Introduction

I chose to investigate an Uber dataset of New York City pickups from January through June of 2015. The Taxi & Limousine Commission (TLC) released the data after receiving a Freedom of Information Law (FOIL) request from FiveThirtyEight on June 22, 2015. I pulled the data from the FiveThirtyEight github repo found here. Each observation represents a single pickup with the following columns:

Variable Definition
Dispatching_base_num The TLC base company code that dispatched the Uber
Pickup_date The full date of the pickup in yyyy-mm-dd h:m:s format
Affiliated_base_num The TLC base company code of the Uber pickup
locationID The pickup location ID of the Uber pickup
Borough The New York City Borough where the pickup took place
Zone The neighborhood in the New York City Borough where the pickup took place
lat The latitude of the pickup Zone
lon The longitude of the pickup Zone
month The month that the pickup took place
date The date (1-31) that the pickup took place
day The day (Sunday-Saturday) that the pickup took place
hour The hour that the pickup took place

I was most interested in how ridership varied over time and location. I wanted to know how it changed by hour, week, day and month. I also wanted to know what the most popular Boroughs and Zones were.

Univariate Plots Section

Hourly Pickups

The graph above shows the total number of pickups per hour. There are 3 notable spikes at midnight (0), 8am and 7pm. There is also a small bump around noon for lunch.

Daily Pickups

Pickups per day followed an expected trend, with demand rising Monday - Friday, then peaking on the weekend. It’s also interesting to note the differences in ranges for each day. I expected Friday’s variation to be larger than any other day but it was tighter and consistently larger instead.

Monthly Pickups

Demand over this 6-month span rose from a little under 2 million rides in January to nearly 3 million in June. This may have been due to an increase in popularity or warmer weather or both, but more data is needed to make generalizations on monthly patterns.

Pickups by Location

The largest variation was in the location data. Staten Island had very few pickups per zone, with most falling in the low 100’s. Manhattan had the most pickups with some zones nearing 500,000. I used a log scale for the y axis because the data was so spread out.

Univariate Analysis

The Uber dataset consists of 14 million observations of pickup data spread across 12 variables. Originally there were only 4 variables (dispatch base number, pickup date, affiliated base number, and location ID), but I joined that with taxi lookup data to get Zone and Borough names for each location ID. I also wanted to map each location to a set of coordinates so I used Google’s API to get the latitude and longitude for each zone. The Google API limits users to 2500 queries during each 24 hour period so I had to be strategic when joining the lookup data to the ridership data.

I parsed the month, date, day, and hour from each pickup timestamp using lubridate to get a more granular analysis of patterns over time. I also binned the pickup data by location and time for some of the histograms and boxplots.
I had a dilema when it came to the taxi lookup dataset because some of the location data was missing. It was a small subset (about 6000 observations) but I did not want to throw away the data. Instead I kept all observations for the time period analyses and excluded them from the location analyses.

Since the dataset was so large, I ran most tests on a sample of 100,000 randomly selected observations before making the final plots on the population. This saved a lot of time since some plots took a couple minutes (each) to run.

The most unusual distributions came from the location data. Some boroughs only had a few hundred pickups while others were in the 10’s of thousands and even hundreds of thousands. This was unusual because as I saw later, these patterns did not change over the course of the week. Manhattan was the busiest location during the week and on the weekends.

Even more unusual was the fact that Manhattan (pop 1.6 million) is not the most populated Borough in New York. Brooklyn (2.6 million) and Queens (2.3 million) have considerably larger populations. By contrast, the population density of Manhattan is twice that of Brooklyn and triple that of Queens. This could be worth investigating for Uber as they expand into more markets. It would be interesting to look at ridership data for other cities to look for correlations.

I used the dplyr and tidyr packages to reshape the data. This helped parse the time data to get month, date, day and hour variables. It also helped when I needed to bin and summarise observations to get the number of rides by time period and location. One day I hope to shake Hadley Wickham’s hand for what he has done for the data science community. He is truly a prolific developer and I feel a certain level of comfort when I see his name in a package’s documentation.

Bivariate Plots

Hourly Pickups

This shows the distribution of pickups throughout the week. It’s easier to see some of the more nuanced patterns that emerge between the days. For instance, the 8am peak is only prevalent Monday - Friday, most likely due to morning rush hour traffic. The evening climax shifts slightly from 6 pm to 7 pm throughout the week and its duration lengthens as passengers edge closer to the weekend. Instead of retiring promptly after work, demand extends through midnight as the night life heats up.

This shift in rider behavior doesn’t begin abruptly on Friday or even Thursday night as I would expect, but begins as early as Tuesday as workers loosen their ties in the evening. Interestingly, the 8am peak remains unchanged throughout the week at a steady 100,000. People are not rushing to work, but they are rushing to the bar in larger numbers as the week presses on.

Saturday and Sunday behavior removes the 8am peak for obvious reasons and instead, demand grows steadily from 6am through midnight. The growth on Saturday extends through this entire time period while Sunday demand tapers off around 6pm as riders prepare for the coming work week.

Monthly Pickups Colored by Day

This graph gave me some of the most interesting information about overall rider behavior. It shows both long-term increases in demand as well as more cyclical patterns during the week. Demand started small in January with most days averaging about 40,000 - 70,000 pickups. Februuary saw an uptick in the number of 70,000+ days and June consistently reached over 100,000 rides each weekend. It is unclear what exactly caused the increase in demand, but it would be interesting to examine year-over-year Uber data to see how rider behavior changed from winter to spring.

Weekly patterns were very interesting. Each week saw the same (relative) pattern. Monday was usually the least popular day, with demand rising steadily before climaxing on Saturday. Interestingly, Sunday was a more popular day than Monday, which could have been partly due to the extra runoff when Saturday night turned into Sunday early morning.

Understanding this weekly cycle piqued my interest in weeks that did not follow the pattern. The week of May 17-24 was particularly interesting because the climax was on a Thursday and demand actually dropped on Friday and Saturday. I tried looking up historical data from the area during that time period but was unable to find anything to explain the strange shift in rider behavior.

This also helped identify outliers. Of particular note were Saturday, May 16 and Saturday, June 27. Both days saw demand surge past 100,000 riders. It seemed worth investigating to understand why ridership increased so much and possibly how to replicate those results. The section below dives into May and June in more depth.

May and June In-Depth

I wanted to understand how demand changed thorughout May and June. In particular, I was interested in whether higher demand days had peaks that were fundamentally different or whether they resulted from the same demand curves. I found a little bit of both. May 16th started the same as any other Saturday, but instead of demand plateauing around 7pm, it continued to rise through 11 pm. This spike in demand was most likely due to the festivals taking place in the area that weekend. I was able to find a notice from the NYC Police Department announcing street closures that weekend due to a number of local festivals and a marathon taking place.

The same patterns emerged on June 27th. It started the same as any other Saturday but demand spiked in the afternoon, toping 10,000 rides for a single hour at 11pm. I was able to find another notice from the NYC Police Department that seemed to explain the huge surge in pickups. Another mixture of local festivals and a marathon was responsible for the increase. And again, demand on Sunday, Monday and Tuesday afterwards was relatively unchanged. I’d be very curious to see how demand was affected on Saturday, July 3rd and 4th. I feel conflicted because demand should probably drop on those days but with July 4th being a major holiday I would expect a good-sized surge in ridership. Unfortunately, there was no available data for the month of July.

Seeing these patterns led me to believe there may be at least two distinct populations of riders. There was the Monday-Friday croud that regularly used the service to get to work but there was also a second, wilder population that liked to party when night fell and on the weekends. This second population was noticeably absent on May 22 and May 23, probably because they were spent from the weekend before. Demand did not suddenly drop off on the Sunday-Thursday of the followng week, but it did on the following Friday and Saturday. The expected peaks from 6pm - 11pm simply did not happen on May 22 or May 23.

May and June Outliers

I wanted to take a more in-depth look at how ridership changed on the two biggest Saturdays in the dataset (May 16th and June 27th). I was also interested in how it changed on the lowest Saturday (May 23rd). All 3 days had above-average turnout in the early morning, but it’s what happened later in the day that had the biggest impact on the numbers. While demand dropped below the average on May 23rd after 10am, it was consistently above normal throughout the day on both May 16th and June 27. This makes sense since demand increases throughout the day after 10am on Saturdays in general.

Bivariate Analysis

The May and June faceted histograms helped me understand demand changes over time. Although it was not a particularly eventful day, May 25th caught my attention because it didn’t look like any other Monday in the graph. The usual 8am peak was curiously absent and although it was the least busy day in May and June, it started higher than all of the other Mondays. It turns out that was Memorial day. Again, not a particularly interesting day but still helpful for extrapolating information about other Monday holidays.

Ridership varied dramatically from week to week over the 6 month period but a few patterns remained constant. Demand was consistently greatest on Friday and Saturday of each week with few exceptions. There were even a few super Saturdays that rose above others. I think that Uber and their drivers could really benefit from taking a closer look at some of this data. For instance, the two largest peaks both happened to coincide with large festivals and marathons. Uber as an organization could benefit from forming city partnerships to promote more activities as they seem to lead to surges in demand.

Uber drivers could benefit from some of this data. As a former driver, I understand the struggle to create a feasible work schedule. Because they are independent contractors, they set their own hours so they are entirely responsible for how much they can earn each day. This can also lead to burnout, however; because it’s often difficult to predict demand without experience.

Multivariate Plots Section

Map Scatter Plot

Most rides originated from Manhattan, with a few outliers in Brooklyn and Queens. I pulled the latitude and longitude data using the Google Maps API and merged them with the Uber pickup data to generate the map. I used ggmap to create the plot and grapped the background from

Heatmap by month and date

This heatmap highlights overall changes in ridership over time. It is easy to see the busiest (June 27th) and slowest (Jan 27th) days during the 6-month time period. I used dplyr’s group_by and summarise functions to bin the ride data by date, then used geom_tile to create the heatmap. I played around with scale_fill_gradientn to get an appropriate-looking color palette. I had to reverse the scaling because originally the higher values were blue and the lower ones were red.

Pickups per hour by Day Scatterplots

The above graphs show the distribution of rides per hour for all dates in the 6-month time period. The top graph is colored by day, which highlighted how demand rose along a gradient from Sunday (yellow) through Saturday (dark red). I threw in a smoothing curve to show the (mean) average demand for all days. The bottom graph helps distinguish the two types of demand curves for Monday - Friday vs Saturday and Sunday.

Multivariate Analysis

These 4 graphs complement each other well to paint a clear picture of rider habits. The map scatterplot highlights the most popular locations while the colored pickups by hour scatterplots show how rider behavior changes over time. I think that Uber should share more of this type of data with drivers. It takes a lot of the guesswork out of driving and would eliminate a lot of regret over missed opportunities. Having a better understanding of rider behavior would help drivers plan a more concrete schedule.

I was surprised that February was so busy. Even though it is the shortest month, more people used the service each day on average than in March or April. It makes sense that Valentine’s day was so busy, but I wonder what caused the uptick the following weekend (Feb 20th and 21st). I would have expected that after the busy Valentine’s day weekend, demand would drop the following weekend. I tried to run a Google Trends query for the timeframe and region, but the results were inconclusive.

Final Plots and Summary

New York City Map Scatterplot

Description

I chose this plot because it gives an easy to understand visualization of the distribution of pickups by location. We can see from the map that Manhattan was by far the most popular region, followed by remote areas in Brooklyn and Queens. The dark red circle in the lower right corner is JFK Airport and the one at the top of Queens is LaGuardia. I coded each dot to represent the number of pickups using both size and color because it was more helpful than either attribute alone. I chose to hide the size guide because it seemed redundant. I also transformed the color and size to a log scale because the range of pickups was massive (between 2 and 460732 pickups). Finally, I played around with the bias in colorRampPalette to stretch the range of colors for larger numbers. This helped highlight how the number of pickups seems to diminish the further you are from the Manhattan epicenter.

Monthly Pickups Colored by Day

Description

This plot gave me the most ideas when performing my exploratory data analysis. It captured a lot of the subtle and not-so subtle patterns along weekly and monthly cycles. I colored the bars by day because it highlighted how demand rose throughout the week and edited the color palette so that all bars were clearly visible. Finally, I changed the labels from scientific notation to decimal notation to make it more human-readable.

Pickups Per Hour with Daily Median Lines

Description

The Pickups by Hour with Daily Median graph was interesting because it showed key differences between weekdays and weekends. There are two distinct demand curves at play here. Saturday and Sunday ridership decreased from midnight to 6am, then steadily rose throughout the rest of the day. It also consistently started higher at midnight because people usually go out more on Friday and Saturday nights. Monday through Friday demand decreased from midnight to 3am, then then abruptly rose as people started waking up to get to work. It was interesting to see how tight and consistent the 8am rush was, as opposed to the 6pm rush.


Reflection

This was an eye-opening project for me. As a former Uber driver I was curious to see how demand changed over time and how it varied by location. It is important to know the market in order to turn a profit. Being in the right place at the right time can be the difference between a $2000 week and a $200 one. It was fun testing my intuitions against the data even when I was wrong. I had no idea that a Thursday or even a Wednesday could be nearly as profitable as a Friday under the right conditions. I was also surprised to learn that demand spiked so much around 6pm and that there really wasn’t much of a lunch rush Monday-Friday.

People need data to make more informed choices. When LinkdIn started sharing even simple data (who’s looked at your profile and profile tips), their popularity skyrocketed. This seems like a great opportunity to empower more drivers. Helping drivers perform their jobs more effectively could be a win for all sides. If more drivers were more aware of the cyclical patterns in their area they would be better equipped to maximizetheir time and profit. Drivers would be more satisfied, leading to lower turnover and passengers would be better served by more experienced (and perhaps friendlier) drivers.

The biggest struggle was trying to capture as much information from each plot as efficiently as possible. There were many graphs I had to throw away because they really didn’t contribute much to the report. I also came across a few road-blocks while using ggmaps. For instance, I originally tried to make a 2d histogram using stat_density2d but was unable to because there were too many observations. All-in-all I was very happy to go through this and I feel like the tools developed here can be applied to a wide variety of data exploration projects.

Sources of data:

  1. Uber: https://github.com/fivethirtyeight/uber-tlc-foil-response
  2. ggmap: D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf